[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback by bbbearxyz · Pull Request #40929 · vllm-project/vllm

bbbearxyz · 2026-04-26T17:21:19Z

Issue：#40928
This PR is based on #40760
Tested on 2 x RTX Pro 6000 (SM120)

Summary

Support Triton fallback ops for DeepSeek V4 flash when DeepGEMM or FlashMLA is not available.

This PR adds a generic Triton implementation path for the DeepSeek V4 branch, including fallback kernels for sparse MLA attention, decode sparse attention, FP8 einsum, sparse attention indexer logits, and MHC prenorm GEMM. The existing optimized DeepGEMM / FlashMLA paths are still preferred when available; the Triton path is only used as a fallback.

Why

My approach for running DeepSeek V4 flash on SM120 is to provide a generic Triton implementation instead of hard-blocking execution on DeepGEMM or FlashMLA availability.

I think this is a reasonable fit for the vLLM DeepSeek V4 branch: when FlashMLA or DeepGEMM does not support a device yet, vLLM should still have a portable implementation that lets users run the model. Triton gives us a more general compatibility layer across GPU architectures, including SM120 and future SM architectures.

The goal of this PR is not to replace the optimized kernels. DeepGEMM and FlashMLA should remain the preferred paths when they are supported. However, when they are unavailable, the Triton fallback gives users a working implementation, even if there is still room for performance optimization.

This also keeps the migration cost low. If DeepGEMM adds SM120 support in the future, vLLM can switch SM120 back to the DeepGEMM path with minimal changes, while still keeping Triton as a portable fallback for other unsupported architectures.

Change

This PR supports DeepSeek V4 flash on SM120 by adding a generic Triton fallback path for kernels that currently depend on DeepGEMM or FlashMLA.

Main changes include:

Add Triton fallback kernels for DeepSeek V4 sparse MLA attention and decode sparse attention.
Add a Triton fallback implementation for the DeepSeek V4 FP8 einsum path.
Add Triton fallback kernels for sparse attention indexer logits.
Add a Triton fallback path for MHC prenorm GEMM.
Keep DeepGEMM / FlashMLA as the preferred optimized paths when available.
Fall back to Triton automatically when DeepGEMM or FlashMLA is unavailable, enabling DeepSeek V4 to run on SM120 and other future unsupported SM architectures.
Keep the implementation compatible with future migration to DeepGEMM once SM120 support becomes available.

Serving benchmark

random input len: 1024
random output len: 1024
num prompts: 32
max_model_len=8192
gpu_memory_utilization=0.9

TP=2, PP=1

max concurrency	duration (h)	Throughput (tok/s)	Output throughput (tok/s)
1	0.1736	104.89	52.44
4	0.0548	332.01	166.00
8	0.0329	553.16	276.58
16	0.0227	800.37	400.19
32	0.0169	1076.35	538.17

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yongye Zhu <yongye@inferact.ai> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Simon Mo <simon@inferact.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roy Wang <yasong.wang@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Zhewen Li <jerven.vllm@gmail.com> Co-authored-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: khluu <khluu000@gmail.com> Co-authored-by: qizixi <zixi@inferact.ai>

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Signed-off-by: qizixi <zixi@inferact.ai>

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

) Co-authored-by: Zhewen Li <zhewenli@inferact.ai>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify · 2026-04-26T17:22:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bbbearxyz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek V4 model architecture, featuring horizontally-fused MLA kernels, specialized MoE gating with softplus_sqrt, and MTP draft model integration. The changes include new CUDA and Triton kernels for optimized attention, quantization, and normalization, along with updates to Docker configurations and external dependencies like DeepGEMM and FlashMLA. Technical feedback identifies critical issues regarding the initialization of E8M0 scales, insufficient hardware capability guards for FP8 intrinsics in CUDA kernels which require SM89+, and a potential tensor reshape error in the Triton fallback logic.

Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>

myshytf · 2026-04-27T18:47:24Z

very nice.

Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>

ehfd · 2026-05-05T15:57:11Z

Please reference #38476 as well. Consideration for sm80/86/89 is also appreciated, since they can use TRITON as well.

zyongye and others added 22 commits April 25, 2026 20:01

chore: pass mypy

908ab01

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

fix: update cuda requirements

cf3e417

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

fix: config

c75c382

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Integrate MegaMoE kernel

5e3525c

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

free up unused weights and support dummy weights

9abe2bd

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[Bugfix] Flatten DeepSeek V32 indexer next_n on non-SM100 archs

f704cf3

Signed-off-by: qizixi <zixi@inferact.ai>

chore: fix pre-commit

b353527

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

fix (ci): interface mismatches

618e3b6

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Add model information

6fac86c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

fix (ci): misc api mismatches

d95a973

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

[Bugfix][CI] Run mooncake HMA worker tests on GPU lane (vllm-project#241

36992a0

) Co-authored-by: Zhewen Li <zhewenli@inferact.ai>

Merge branch 'main' into feat/dsv4-support

6a2e1ed

CI Failure for deep_gemm and layernorm_fp8_quant

e5108f7

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

FIX

f35845c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into feat/dsv4-support

6f95a90

fix (ci): an e2e OOM issue and a MTP model registery issue

9e5f0da

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

chore: pin tilelang version

f21fcc1

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

fix (ci): pre-commit happy

5cd8311

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

FIX

999637f

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

FIX

ab72e57

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into feat/dsv4-support

fe61cd4

bbbearxyz requested review from ApostaC, NickLucche, aarnphm, bbrowning, chaunceyjiang, orozery, sfeng33 and tjtanaa as code owners April 26, 2026 17:21

mergify Bot added the v1 label Apr 26, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 26, 2026

github-project-automation Bot added this to NVIDIA Apr 26, 2026

mergify Bot added tool-calling kv-connector labels Apr 26, 2026

github-project-automation Bot added this to Tool Calling Apr 26, 2026

mergify Bot added the needs-rebase label Apr 26, 2026

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/utils/fp8_utils.py

Comment thread csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu

Comment thread csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu

Comment thread vllm/model_executor/layers/deepseek_v4_triton_kernels.py

bbbearxyz force-pushed the support_sm120_deepseekv4 branch from aab48af to c1418ec Compare April 26, 2026 17:49

bbbearxyz requested a review from bigPYJ1151 as a code owner April 26, 2026 17:49

mergify Bot added frontend cpu Related to CPU backends and removed needs-rebase labels Apr 26, 2026

bbbearxyz changed the title ~~Support DeepSeek V4 on SM120 with Triton fallback~~ [WIP]Support DeepSeek V4 on SM120 with Triton fallback Apr 26, 2026

bbbearxyz force-pushed the support_sm120_deepseekv4 branch 2 times, most recently from 943bd81 to 9fca2b9 Compare April 26, 2026 18:14

Support DeepSeek V4 on SM120 with Triton fallback

b2a9e98

Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>

bbbearxyz force-pushed the support_sm120_deepseekv4 branch from 9fca2b9 to b2a9e98 Compare April 26, 2026 18:16

add comment

d521d3e

bbbearxyz force-pushed the support_sm120_deepseekv4 branch from 0a93f94 to d521d3e Compare April 26, 2026 18:30

bbbearxyz changed the title ~~[WIP]Support DeepSeek V4 on SM120 with Triton fallback~~ [WIP]Support DeepSeek V4 flash on SM120 with Triton fallback Apr 26, 2026

Merge branch 'main' into support_sm120_deepseekv4

10934ad

jasl pushed a commit to jasl/vllm that referenced this pull request Apr 28, 2026

Add DeepSeek V4 Triton fallback kernels

a330a05

Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>

jasl pushed a commit to jasl/vllm that referenced this pull request Apr 28, 2026

Add DeepSeek V4 Triton fallback kernels

3eda3d9

Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>

jasl mentioned this pull request Apr 28, 2026

[DSv4][Nvidia] SM12x DeepSeek V4 support #40991

Draft

aqua001 pushed a commit to aqua001/vllm that referenced this pull request May 2, 2026

Add DeepSeek V4 Triton fallback kernels

fef5a43

Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>

ehfd mentioned this pull request May 5, 2026

TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support #38476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929

[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929
bbbearxyz wants to merge 25 commits intovllm-project:mainfrom
bbbearxyz:support_sm120_deepseekv4

bbbearxyz commented Apr 26, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

myshytf commented Apr 27, 2026

Uh oh!

ehfd commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

bbbearxyz commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Change

Serving benchmark

TP=2, PP=1

Uh oh!

mergify Bot commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

myshytf commented Apr 27, 2026

Uh oh!

ehfd commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

bbbearxyz commented Apr 26, 2026 •

edited

Loading